1 Map of sites and samples


1.0.1 Prior mislabeling of samples from ONF

The Kansas site ONF was labeled as ‘South Middle’ in factor ‘Bin’ in previous manuscripts. It should be ‘North’.

Map of sites colored by latitudinal bin.



2 Overview of Site clusters (NMDS by Site)


I wanted to see if samples grouped by geographic location, and given that sites range widely in distance from one another, I decided to visually assess using a color scheme that at least roughly represented their location in 2D space. See the map below:



In the NMDS (Bray-Curtis), we see definite clustering by Site, but not all groupings are distinct.

We also see evidence of potentially significant spatial relationships (i.e., distance-decay).

2.1 All in 1 plot


2.2 Facet by Lat bin


2.3 Facet by Long bin


2.4 Lat:Long


3 More in-depth: distance-based clustering


Background Microbiome sample clustering can be performed using either model-based methods and machine learning methods.
- Machine learning methods, which rely on defined distance metrics, are used more frequently than model-based statistical methods (“due to their efficient implementation and easy interpretation.”)
- I used the partition around medoids (PAM) clustering method, which is related to but considered more robust than K-means. In contrast to K-means, which can be sensitive to the effects of outliers, PAM’s optimization goal is to minimize the sum of distances to the medoids instead of minimizing the sum of the squared distances to the cluster centers.

Note: clustering was performed directly on distance matrices, not ordinations or ordination scores


3.1 Gap Statistic (on distance matrix)


view gap on bray


view gap on sørensen


3.2 perform PAM

# Perform PAM clustering
pam_fwc_bc_k4 <- pam(Fun_wc_bray_distmat, k = 4, diss = T, cluster.only = T) 

pam_fwc_sor_k10 <- pam(Fun_wc_sorensen_distmat, k = 10, diss = T, cluster.only = T) 
pam_fwc_sor_k30 <- pam(Fun_wc_sorensen_distmat, k = 30, diss = T, cluster.only = T) 

saveRDS(pam_fwc_bc_k4, file = "../processed_data/clean_rds_saves/pam/pam_fwc_bc_k4.rds")
saveRDS(pam_fwc_sor_k10, file = "../processed_data/clean_rds_saves/pam/pam_fwc_sor_k10.rds")
saveRDS(pam_fwc_sor_k30, file = "../processed_data/clean_rds_saves/pam/pam_fwc_sor_k30.rds")


3.3 view clusters on NMDS


Sørensen-based clusters


k = 10

Note the density of cluster 1 - I’ll investigate that further.

Cluster descriptions

Descriptions of Sørensen distance-based clusters. †variables listed are significant in all vs. base-mean Wilcox test with BH p-value corrections. * p-value < 0.05, ** < 0.001, *** < 0.0001, **** < 1e-5.
Cluster total n Site(s) Grass(es) Characteristics† Exclusivity
1 183 spans pH range
2 37 BNP,DMT,FMT,SEV BOER (n=33), BOGR (n=4) high pH* (6.8 - 8.3, mean 7.5)
3 56 high pH**** (6.8 - 8.3, mean 7.8)
4 17 lower pH**** (5.6 - 7.6, mean 6.2)
5 43 high pH** (7.1 - 7.8, mean 7.6)
6 32 lower pH**** (5.1 - 7.8, mean 6.2)
7 38
8 29 mostly LAR (n=27) high pH**** (7.2 - 8.2, mean 8.0)
9 9 SFA SCSC (only grass present) Site=SFA
10 29 KAE lower pH**** (5.9 - 7.2, mean 6.3) Site=KAE (of the 32 KAE samples, only 3 others were in diff clusters)


All samples


Facet by Lat


Facet by Long


We do see clusters with only 1 site and others with multiple sites.


Bray-Curtis


k = 4

All samples


Facet by Lat


Facet by Long


3.4 Are the clusters distinct/distinguishable?

Clusters: Sørensen dissimilarity clusters based on pam (k = 10)

Method: R package randomForest v4.7.1.1

predictors.all<-t(otu_table(Fun_wholecommunity))

response.clus_sor_k10<-as.factor(sample_data(Fun_wholecommunity)$clus_sor_k10)

rf.data.clus_sor_k10<-data.frame(response.clus_sor_k10, predictors.all)

classify.clus_sor_k10<-randomForest(response.clus_sor_k10~., data = rf.data.clus_sor_k10, ntree=999)

Call: randomForest(formula = response.clus_sor_k10 ~ ., data = rf.data.clus_sor_k10, ntree = 999)

Type of random forest: classification

Number of trees: 999

No. of variables tried at each split: 81

OOB estimate of error rate: 6.82%

Confusion matrix of fungal whole community randomForest classification based on Sørensen-based PAM clusters. Diagonals are accurate calls. Misclassifications are shown as values by row, i.e., cluster 2 was misclassified as cluster 3 in 3 samples, while 34 samples were accurately called, resulting in an error rate of 8.1% for that cluster.
1 2 3 4 5 6 7 8 9 10 Cluster error %
1 183 0.0
2 34 3 8.1
3 5 50 1 10.7
4 12 4 1 29.4
5 1 50 1 1 5.7
6 7 23 2 28.1
7 1 2 34 1 10.5
8 1 27 1 6.9
9 10 0
10 1 28 3.4

Notes: OTU205 is highest in relative abundance in Cluster 1, most other clusters have minimal or no presence. Exception is sample in Cluster 7, which was the one misclassified into Cluster 1.

OTU6679, Generalist.test: all (global RA = 1.65%) v. Cluster 1 (mean RA = 3.75%)


3.5 Dendrogram of Sørensen-based clusters (k = 10)


Fungal whole community

Dendrogram of Sørensen-based clusters - Fungal whole community
Dendrogram of Sørensen-based clusters - Fungal whole community

Clusters 2 & 3 seem to be the outgroup / most distantly related
Clusters 1 & 9 are very similar, followed by 10 -> wrap-around cluster numbering

3.6 Mapping metadata


3.6.1 Edaphic


pH


soil moisture


SOM


ammonium


phosphorous


3.6.2 Climate


Precipitation

ppt3yr = mean annual precipitation (determined over a 3-year window)


Growing degree days

GDD3yr = growing degree days (determined over a 3-year window)


3.6.3 Plant Traits


Specific root length

avg_SRL = specific root length


Specific leaf area

avg_SLA = specific leaf area


Herbivory percent

herbivory_perc = average site level damage estimate for herbivory, averaged over all species and individuals at the site; could be used to indicate herbivory pressure at the site level - ranges from 1:17 (%)


3.6.4 Factors


Longitude (longitudinal ‘gradients’)